Docker 程序崩掉了

遇到问题

今天登陆测试机想继续玩一玩 Docker, 结果发现:

Alt Docker报错展示

所有的 Container 都提示 “Exited (137) xx hours ago”

怀疑是 Docker 报错

以为是 Docker 报错, 在网上搜索了一圈无果.

偶然发现一个帖子:

http://stackoverflow.com/questions/31297616/what-is-the-authoritative-list-of-docker-run-exit-codes

主要内容如下:

You mean the exit status shown in docker ps when a container completes? That's the exit status of the process, so it's completely application dependent i.e:

$ docker run debian sh -c "exit 5;"
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
7fcc37778df0        debian              "sh -c 'exit 5;'"   4 seconds ago       Exited (5) 3 seconds ago                       reverent_einstein

Whilst it's true that the docker client or docker server may also throw an error, the status codes aren't documented to the best of my knowledge. Open a issue on the GitHub project if you feel strongly about it.

大概意思是 Container 中的 Exited Code 并不是 Docker 报出来的, 而是 Container 中运行的程序退出报出来的.

错误来源于 Linux 系统

于是将注意力转向内部程序, 由于错误号都很一致, 于是就是查找 Linux 系统级别的错误

找到文章:

http://tldp.org/LDP/abs/html/exitcodes.html

里面大概内容是 Exited Code 号范围是 0~255, 其中大于 128 的是 128+n 代表 Fatal error signal “n”

注意到示例中的内容 “kill -9 $PPID of script”, 注释为 “$? returns 137 (128 + 9)”

注意, 注意!!! 137 出现了!!!

错误号 137 表示 127 + 9, 其中 9 表示 kill -9 触发的错误

也就是说, 在某一时间点, 因为某些原因, 所有的进行同一时间(或者是 Docker 主服务)被系统发送的 kill -9 信号干掉了…

什么情况会让系统内核向应用程序发送 SIGKILL

查找原因, 有人发帖遇到类型问题:

http://bbs.chinaunix.net/thread-3670052-1-1.html

究其本质为如下:

应当是系统内存不足了，内核会主动 kill 掉一些进程来回收内存
还有 load balance 的问题，虽然整个系统看来 memory 不少，但如果负荷分配不恰当，还会有某核过载而杀死进程。

查找监控印证理论

查找监控对应时间点服务器负载, 果然内存吃紧, 原因找到.